In this lecture, we'll get into more detail on Python variables, as well as language syntax. By the end, you should be able to:
This week will effectively be a "crash-course" in Python basics; there's a lot of ground to cover!
We saw in the last lecture how to define variables, as well as a few of the basic variable "types" available in Python. It's important to keep in mind that each variable you define has a "type", and this type will dictate much (if not all) of the operations you can perform on and with that variable.
To recap: a variable in Python is a sort of placeholder that stores a value. Critically, a variable has both a name and a type. For example:
In [ ]:
x = 2
It's easy to determine the name of the variable; in this case, the name is $x$. It can be a bit more complicated to determine the type of the variable, as it depends on the value the variable is storing. In this case, it's storing the number 2. Since there's no decimal point on the number, we call this number an integer, or int for short.
Thus, in this case, the name of the variable is x
and the type is int
.
In [12]:
y = 2.0
In this example, since y
is assigned a value of 2.0, it is referred to as a floating-point variable, or float for short. It doesn't matter that the decimal is 0; internally, Python sees the explicit presence of a decimal and treats the variable y
as having type float
.
Floats do the heavy-lifting of much of the computation in data science. Whenever you're computing probabilities or fractions or normalizations, floats are the types of variables you're using. In general, you tend to use floats for heavy computation, and ints for counting things.
There is an explicit connection between ints and floats. Let's illustrate with an example:
In [13]:
x = 2
y = 3
z = x / y
In this case, we've defined two variables x
and y
and assigned them integer values, so they are both of type int
. However, we've used them both in a division operation and assigned the result to a variable named z
. If we were to check the type of z
, what type do you think it would be?
z
is a float!
In [14]:
type(z)
Out[14]:
How does that happen? Shouldn't an operation involving two ints produce an int? In general, yes it does. However, in cases where a decimal number is outputted, Python implicitly "promotes" the variable storing the result. This is known as casting, and it can take two forms: implicit casting (as we just saw), or explicit casting.
Implicit casting is done in such a way as to try to abide by "common sense": if you're dividing two numbers, you would all but expect to receive a fraction, or decimal, on the other end. If you're multiplying two numbers, the type of the output depends on the types of the inputs--two floats multiplied will likely produce a float, while two ints multiplied will produce an int.
In [15]:
x = 2
y = 3
z = x * y
type(z)
Out[15]:
In [16]:
x = 2.5
y = 3.5
z = x * y
type(z)
Out[16]:
Explicit casting, on the other hand, is a little trickier. In this case, it's you the programmer who are making explicit (hence the name) what type you want your variables to be. Python has a couple special built-in functions for performing explicit casting on variables, and they're named what you would expect: int()
for casting a variable as an int, and float
for casting it as a float.
In [17]:
x = 2.5
y = 3.5
z = x * y
print("Float z:\t{}\nInteger z:\t{}".format(z, int(z)))
Whoa! What's going on here?
With explicit casting, you are telling Python to override its default behavior. In doing so, it has to make some decisions as to how to do so in a way that still makes sense. When you cast a float
to an int
, some information is lost; namely, the decimal. So the way Python handles this is by quite literally discarding the entire decimal portion.
In this way, even if your number was 9.999999999 and you perfomed an explicit cast to int()
, Python would hand you back a 9.
Python as a language is known as dynamically typed. This means you don't have to specify the type of the variable when you define it; rather, Python infers the type based on how you've defined it and how you use it. As we've already seen, Python creates a variable of type int
when you assign it an integer number like 5, and it automatically converts the type to a float
whenever the operations produce decimals.
Other languages, like C++ and Java, are statically typed, meaning in addition to naming a variable when it is declared, the programmer must also explicitly state the type of the variable.
There are pros and cons to both dynamic and static typing; in particular, some would argue that it's easier to make mistakes in dynamically typed languages, as one isn't always 100% certain of what Python's underlying type representation is without explicitly checking the type of the variable. On the other hand, not having to declare types every time you define a new variable can eliminate a lot of boilerplate code.
In fact, when checking for the type of the variable, Python implements what is known as duck typing: if it walks like a duck and quacks like a duck, it's a duck. As such, Python checks the properties of the variables you've defined and treats each variable as the type it most resembles.
This brings us to a concept known as type safety. This is an important point, especially in dynamically typed languages where the type is not explicitly set by the programmer: there are countless examples of nefarious hacking that has exploited a lack of type safety in certain applications in order to execute malicious code.
A particularly fun example is known as a roundoff error, or more specifically to our case, a representation error. This occurs when we are attempting to represent a value for which we simply don't have enough precision to accurately store. This gets a little technical, but basically there are a certain number of bits allocated that represent the whole number part of a float, and a certain number that represent the decimal part of a float.
For example, let's say we wanted to build an algorithm that automatically predicts whether an email we received is spam or not. To do this, we have to multiply a lot of probabilities together. Probabilities are floats between 0 and 1. Now imagine multiplying several hundreds to thousands of these values together; we'll very likely end up with tiny, tiny numbers. In the case of an underflow error, Python may very well set our total probability to 0 as it won't have enough bits to represent the full decimal.
One of the most popular examples of an overflow error was the Y2K bug. In this case, most Windows machines internally stored the year as simply the last two digits. Thus, when the year 2000 rolled around, the two numbers representing the year overflowed and reset to 00. A similar problem is anticipated for 2038, when 32-bit Unix machines will also see their internal date representations overflow to 0.
In these cases, and especially in dynamically typed languages like Python, it is very important to know what types of variables you're working with and what the limitations of those types are.
In [18]:
x = "this is a string"
type(x)
Out[18]:
Unlike numerical types like ints and floats, you can't really perform arithmetic operations on strings, with one exception:
In [19]:
x = "some string"
y = "another string"
z = x + " " + y
print(z)
The +
operator, when applied to strings, is called string concatenation. This means, quite literally, that it glues or concatenates two strings together to create a new string. In this case, we took the string in x
, concatenated it to an empty space " "
, and concatenated that again to the string in y
, storing the whole thing in a final string z
.
Other than the +
operator, the other arithmetic operations aren't defined for strings, so I wouldn't recommend trying them...
In [20]:
s = "2"
t = "divisor"
x = s / t
Casting, however, is alive and well with strings. In particular, if you know the string you're working with is a string representation of a number, you can cast it from a string to a numeric type:
In [21]:
s = "2"
x = int(s)
print("x = {} and has type {}.".format(x, type(x)))
And back again:
In [22]:
x = 2
s = str(x)
print("s = {} and has type {}.".format(s, type(s)))
Strings also have some useful methods that numeric types don't for doing some basic text processing.
In [23]:
s = "Some string with WORDS"
print(s.upper()) # make all the letters uppercase
print(s.lower()) # make all the letters lowercase
A very useful method that will come in handy later in the course when we do some text processing is strip()
. Often when you're reading text from a file and splitting it into tokens, you're left with strings that have leading or trailing whitespace:
In [25]:
s1 = " python "
s2 = " python"
s3 = "python "
Anyone who looked at these three strings would say they're the same, but the whitespace before and after the word python
in each of them results in Python treating them each as unique. Thankfully, we can use the strip
method:
In [27]:
print("|" + s1.strip() + "|")
print("|" + s2.strip() + "|")
print("|" + s3.strip() + "|")
You can also delimit strings using either single-quotes or double-quotes. Either is fine and largely depends on your preference.
In [ ]:
s = "some string"
t = 'this also works'
Python also has a built-in method len()
that can be used to return the length of a string. The length is simply the number of individual characters (including any whitespace) in the string.
In [41]:
s = "some string"
len(s)
Out[41]:
We can also compare variables! By comparing variables, we can ask whether two things are equal, or greater than or less than some other value. This sort of true-or-false comparison gives rise to yet another type in Python: the boolean type. A variable of this type takes only two possible values: True
or False
.
Let's say we have two numeric variables, x
and y
, and want to check if they're equal. To do this, we use a variation of the assginment operator:
In [28]:
x = 2
y = 2
x == y
Out[28]:
Hooray! The ==
sign is the equality comparison operator, and it will return True
or False
depending on whether or not the two values are exactly equal. This works for strings as well:
In [30]:
s1 = "a string"
s2 = "a string"
s1 == s2
Out[30]:
In [31]:
s3 = "another string"
s1 == s3
Out[31]:
We can also ask if variables are less than or greater than each other, using the <
and >
operators, respectively.
In [32]:
x = 1
y = 2
x < y
Out[32]:
In [33]:
x > y
Out[33]:
In a small twist of relative magnitude comparisons, we can also ask if something is less than or equal to or greater than or equal to some other value. To do this, in addition to the comparison operators <
or >
, we also add an equal sign:
In [35]:
x = 2
y = 3
x <= y
Out[35]:
In [36]:
x = 3
x <= y
Out[36]:
In [37]:
x = 3.00001
x <= y
Out[37]:
Interestingly, these operators also work for strings. Be careful, though: their behavior may be somewhat unexpected until you figure out what actual trick is happening:
In [39]:
s1 = "some string"
s2 = "another string"
s1 > s2
Out[39]:
In [40]:
s1 = "Some string"
s1 > s2
Out[40]:
There are some rules regarding what can and cannot be used as a variable name. Beyond those rules, there are guidelines.
All the letters a-z (upper and lowercase), the numbers 0-9, and underscores are at your disposal. Anything else is illegal. No special characters like pound signs, dollar signs, or percents are allowed. Hashtag alphanumerics only.
Numbers cannot be the first character of a variable name. message_1
is a perfectly valid variable name; however, 1_message
is not and will throw an error.
Underscores are how Python programmers tend to "simulate" spaces in variable names, but simply put there's no way to name a variable with multiple words separated by spaces.
This might take some trial-and-error. Basically, if you try to name a variable print
or float
or str
, you'll run into a lot of problems down the road. Technically this isn't outlawed in Python, but it will cause a lot of headaches later in your program.
I've been giving a lot of examples using variables named x
, s
, and so forth. This is bad. Don't do it--unless, for example, you're defining x
and y
to be points in a 2D coordinate axis, or as a counter; one-letter variable names for counters are quite common.
Outside of those narrow use-cases, the variable names should constitute a pithy description that reflects their function in your program. A variable storing a name, for example, could be name
or even student_name
, but don't go as far as to use the_name_of_the_student
.
l
or uppercase O
.This is one of those annoying rules that largely only applies to one-letter variables: stay away from using letters that also bear striking resemblance to numbers. Naming your variable l
or O
may confuse downstream readers of your code, making them think you're sprinkling 1s and 0s throughout your code.
Java programmers may take umbrage with this point: the convention there is to useCamelCase for multi-word variable names. Since Python takes quite a bit from the C language (and its back-end is implemented in C), it also borrows a lot of C conventions, one of which is to use underscores and all lowercase letters in variable names.
The one exception to this rule is when you define variables that are constant; that is, their values don't change. In this case, the variable name is usually in all-caps. For example: PI = 3.14159
.
In [42]:
# Adds two numbers that are initially strings by converting them to an int and a float,
# then converting the final result to an int and storing it in the variable x.
x = int(int("1345") + float("31.5"))
print(x)
Comments are important to good coding style and should be used often for clarification. However, even more preferable to the liberal use of comments is a good variable naming convention. For instance, instead of naming a variable "x" or "y" or "c", give it a name that describes its purpose.
In [44]:
str_length = len("some string")
I could've used a comment to explain how this variable was storing the length of the string, but by naming the variable itself in terms of what it was doing, I don't even need such a comment. It's self-evident from the name itself what this variable is doing.
Whitespace (no, not that Whitespace)) is important in the Python language. Some languages like C++ and Java use semi-colons to delineate the end of a single statement. Python, however, does not, but still needs some way to identify when we've reached the end of a statement.
In Python, it's the return key that denotes the end of a statement. Returns, tabs, and spaces are all collectively known as "whitespace", and each can drastically change how your Python program runs. Especially when we get into loops, conditionals, and functions, this will become critical and may be the source of many insidious bugs.
For example, the following code won't run:
In [45]:
x = 5
x += 10
Python sees the indentation--it's important to Python in terms of delineating blocks of code--but in this case the indentation doesn't make any sense. It doesn't highlight a new function, or a conditional, or a loop. It's just "there", making it unexpected and hence causing the error.
This can be particularly pernicious when writing longer Python programs, full of functions and loops and conditionals, where the indentation of your code is constantly changing. For this reason, I am giving you the following mandate:
DO NOT MIX TABS AND SPACES!!!
If you're indenting your code using 2 spaces, ALWAYS USE SPACES.
If you're indenting your code using 4 spaces, ALWAYS USE SPACES.
If you're indenting your code with a single tab, ALWAYS USE TABS.
Mixing the two in the same file will cause ALL THE HEADACHES. Your code will crash but will be coy as to the reason why.
Some questions to discuss and consider:
1: Multiply 2.1 and 3.1 in Python. What do you get? Why?
2: Let's say you want to know how many words are in a document (like the review question in the last lecture). What type of variable would we use to store that value, and why?
3: What does len()
return for s = "string"
? How about s = " string "
? How about s = " string ".strip()
?
4: Give an example of a variable name that is used to store the average area of a group of squares. What type would this variable be?
5: I'm opening up my favorite text editor to write a Python script. Should I configure it to use tabs or spaces?